<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>org.terrier.indexing package</title>
<!--
Terrier - Terabyte Retriever 
Webpage: http://ir.dcs.gla.ac.uk/terrier 
Contact: terrier{a.}dcs.gla.ac.uk
University of Glasgow - Department of Computing Science
Information Retrieval Group
 
The contents of this file are subject to the Mozilla Public
License Version 1.1 (the "License"); you may not use this file except 
compliance with the License. You may obtain a copy of the
License at http://www.mozilla.org/MPL/

Software distributed under the License is distributed on an "AS IS"
basis, WITHOUT WARRANTY OF ANY KIND, either express or
implied. See the License for the specific language governing rights and
limitations under the License.

Copyright (C) 2004-2010 the University of Glasgow. All Rights Reserved.
-->
</head>
<body bgcolor="white">
<p>Provides classes and interfaces related to the indexing of documents.
There are three main abstract concepts that are related to the code
of this package.</p>
<p>The first is the concept of a collection of documents. This can be 
a standard TREC test collection, or a connection to a database from 
where the documents are extracted.</p>

<p>The second abstraction is the concept of a document. An implementation 
of a collection should iterate through the documents in the collection 
and return one at a time. The document encapsulates the parser required
to extract the information to index. Implementations of documents are
provided for TREC documents, PDF documents and standard Microsoft Office
formats, such as MS Word, MS Powerpoint and MS Excel.</p>

<p>The third abstraction is related to the indexer, the process that 
iterates through the documents of a collection and creates the 
necessary data structures. There are two implemented indexers. The first
one saves field information, if specified, while the second one saves
position information as well.</p>
</body>
</html>
